Data collection and annotation for state-of-the-art NER using unmanaged crowds
نویسندگان
چکیده
This paper presents strategies for generating entity level annotated text utterances using unmanaged crowds. These utterances are then used to build state-of-the-art Named Entity Recognition (NER) models, a required component to build dialogue systems. First, a wide variety of raw utterances are collected through a variant elicitation task. We ensure that these utterances are relevant by feeding them back to the crowd for a domain validation task. We also flag utterances with potential spelling errors and verify these errors with the crowd before discarding them. These strategies, combined with a periodic CAPTCHA to prevent automated responses, allow us to collect high quality text utterances despite the inability to use the traditional gold test question approach for spam filtering. These utterances are then tagged with appropriate NER labels using unmanaged crowds. The crowd annotation was 23% more accurate and 29% more consistent than in-house annotation.
منابع مشابه
PAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملWeakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Representation Projection
The state-of-the-art named entity recognition (NER) systems are supervised machine learning models that require large amounts of manually annotated data to achieve high accuracy. However, annotating NER data by human is expensive and time-consuming, and can be quite difficult for a new language. In this paper, we present two weakly supervised approaches for cross-lingual NER with no human annot...
متن کاملSwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
We present SWELLSHARK, a framework for building biomedical named entity recognition (NER) systems quickly and without hand-labeled data. Our approach views biomedical resources like lexicons as function primitives for autogenerating weak supervision. We then use a generative model to unify and denoise this supervision and construct large-scale, probabilistically labeled datasets for training hi...
متن کاملHarnessing Diversity in Crowds and Machines for Better NER Performance
Over the last years, information extraction tools have gained a great popularity and brought significant performance improvement in extracting meaning from structured or unstructured data. For example, named entity recognition (NER) tools identify types such as people, organizations or places in text. However, despite their high F1 performance, NER tools are still prone to brittleness due to th...
متن کاملState of the art in Turkish Named Entity Recognition
Named entity recognition (NER), which provides useful information for many high level NLP applications and semantic web technologies, is a well-studied topic for most of the languages and especially for English. However the studies for Turkish, which is a morphologically richer and lesser-studied language, have fallen behind these for a long while. In recent years, Turkish NER intrigued researc...
متن کامل